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Abstract 

We show that unconverged stochastic gradient 
descent can be interpreted as a procedure that 
samples from a nonparametric variational ap¬ 
proximate posterior distribution. This distribu¬ 
tion is implicitly defined as the transformation 
of an initial distribution by a sequence of op¬ 
timization updates. By tracking the change in 
entropy over this sequence of transformations 
during optimization, we form a scalable, unbi¬ 
ased estimate of the variational lower bound on 
the log marginal likelihood. We can use this 
bound to optimize hyperparameters instead of 
using cross-validation. This Bayesian interpre¬ 
tation of SGD suggests improved, overfitting- 
resistant optimization procedures, and gives a 
theoretical foundation for popular tricks such as 
early stopping and ensembling. We investigate 
the properties of this marginal likelihood estima¬ 
tor on neural network models. 

1 Introduction 

In much of machine learning, the central computational 
challenge is optimization: we try to minimize some 
training-set loss with respect to a set of model parame¬ 
ters. If we treat the training loss as a negative log-posterior, 
this amounts to searching for a maximum a posteriori 
(MAP) solution. Paradoxically, over-zealous optimization 
can yield worse test-set results than incomplete optimiza¬ 
tion due to the phenomenon of over-training. A popular 
remedy to over-training is to invoke “early stopping” in 
which optimization is halted based on the continually mon¬ 
itored performance of the parameters on a separate vali¬ 
dation set. However, early stopping is both theoretically 
unsatisfying and incoherent from a research perspective: 
how can one rationally design better optimization meth¬ 
ods if the goal is to achieve something “powerful but not 
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Figure 1: A series of variational distributions implicitly de¬ 
fined by gradient descent on the log-likelihood of the pos¬ 
terior. Intermediate distributions (green and blue) are im¬ 
plicitly defined by mapping each possible random initial 
parameters through many iterations of optimization. These 
distributions don’t have fixed parametric shape, and will 
eventually concentrate around the mode. 


too powerful”? A related trick is to ensemble the results 
from multiple optimization runs from different starting po¬ 
sitions. Similarly, this must rely on imperfect optimization, 
since otherwise all optimization runs would reach the same 
optimum. 

We propose an interpretation of incomplete optimization in 
terms of variational Bayesian inference, and provide a sim¬ 
ple method for estimating the marginal likelihood of the ap¬ 
proximate posterior. Our starting point is a Bayesian pos¬ 
terior distribution for a potentially complicated model, in 
which there is an empirical loss that can be interpreted as a 
negative log likelihood and regularizers that have interpre- 










tations as priors. One might proceed with MAP inference, 
and perform an optimization to find the best parameters. 
The main idea of this paper is that such an optimization 
procedure, initialized according to some distribution that 
can be chosen freely, generates a sequence of distributions 
that are implicitly dehned by the action of the optimiza¬ 
tion update rule on the previous distribution. We can treat 
these distributions as variational approximations to the true 
posterior distribution. A single optimization run for N it¬ 
erations represents a draw from the Nth such distribution 
in the sequence. Figure 1 shows contours of these approxi¬ 
mate distributions on an example posterior. 

With this interpretation, the number of optimization itera¬ 
tions can be seen as a variational parameter, one that trades 
off htting the data well against maintaining a broad (high 
entropy) distribution. Early stopping amounts to optimiz¬ 
ing the variational lower bound (or an approximation based 
on a validation set) with respect to this variational parame¬ 
ter. Ensembling different random restarts can be viewed as 
taking independent samples from the variational posterior. 

To establish whether this viewpoint is helpful in practice, 
we ask; can we efficiently estimate the marginal likelihood 
implied by unconverted optimization? We tackle this ques¬ 
tion in section 2. Specihcally, for stochastic gradient de¬ 
scent (SGD), we show how to compute an unbiased esti¬ 
mate of a lower bound on the log marginal likelihood of 
each iteration’s implicit variational distribution. We also 
introduce an ‘entropy-friendly’ variant of SGD that main¬ 
tains better-behaved implicit distributions. 

We also ask whether model selection based on these 
marginal likelihood estimates picks models with good test¬ 
time performance. We give some experimental evidence 
in both directions in section 5. A related question is how 
close the variational distributions implied by various opti¬ 
mization rules approximate the true posterior. We briefly 
address this question in section 6. 

1.1 Contributions 

• We introduce a new interpretation of optimization al¬ 
gorithms as samplers from a variational distribution 
that adapts to the true posterior, eventually collapsing 
around its modes. 

• We provide a scalable estimator for the entropy of 
these implicit variational distributions, allowing us to 
estimate a lower bound on the marginal likelihood 
of any model whose posterior is twice-differentiable, 
even on problems with millions of parameters and data 
points. 

• In principle, this marginal likelihood estimator can be 
used for hyperparameter selection and early stopping 
without the need for a validation set. We investigate 
the performance of these estimators empirically on 


neural network models, and show that they have rea¬ 
sonable properties. However, further refinements are 
likely to be necessary before this marginal likelihood 
estimator is more practical than using a validation set. 

2 Incomplete optimization as variational 
inference 

Variational inference (Wainwright & Jordan, 2008) aims to 
approximate an intractable posterior distribution, p(0|x), 
with another more tractable distribution, q{0). The 
usual measure of the quality of the approximation is the 
Kullback-Leibler (KL) divergence from q{9) to p{9,x). 
This measure provides a lower bound on the marginal like¬ 
lihood of the original model; applying Bayes’ rule to the 
definition of KL (g(0)||p(0|x)) gives the familiar inequal¬ 
ity; 

logp(x) > -E,(e) [-logp(6i,x)] -E,(e) [log(?(6»)] 

'-V--V-" 

Energy E[q] Enti'opy S[q] 
:=C[q] (1) 

Maximizing C\q\, the variational lower bound on 
the marginal likelihood, with respect to q minimizes 
KL (g(0)||p(0|x)), the KL divergence from q to the true 
posterior, giving the closest approximation available within 
the variational family. A convenient side effect is that we 
also get a lower bound on p(x), which can be used for 
model selection. 

To perform variational inference, we require a family of 
distributions over which to maximize C[q\. Consider a gen¬ 
eral procedure to minimize the energy (— \ogp{9, x)) with 
respect to 0 € The parameters 9 are initialized accord¬ 
ing to some distribution qo (0) and updated at each iteration 
according to a transition operation T : —)■ 

Oo ~ qoi9) 

9t+i = T{9t), 

Our variational family consists of the sequence of distribu¬ 
tions qo: 9ij < 72 , ■ ■ ■. where qt{9) is the distribution over 9t 
generated by the above procedure. These distributions 
don’t have a closed form, but we can exactly sample from 
qt by simply running the optimizer for t steps starting from 
a random initialization. 

As shown in (1), C consists of an energy term and an en¬ 
tropy term. The energy term measures how well q fits the 
data and the entropy term encourages the probability mass 
of q to spread out, preventing overfltting. As optimization 
of 0 proceeds from its qo-distributed starting point, we can 
examine how C changes. The negative energy term grows, 
since the goal of the optimization is to reduce the energy. 
The entropy term shrinks because the optimization con¬ 
verges over time. Optimization thus generates a sequence 





of distributions that range from underfitting to overfitting, 
and the variational lower bound captures this tradeoff. 

We cannot evaluate C[qt\ exactly, but we can obtain an un¬ 
biased estimator. Sampling 9q from and then applying 
the transition operator t times produces an exact sample 9q 
from qt, by definition. Since 9t is an exact sample from 
qt{9), logp{9t,x.) is an unbiased estimator of the energy 
term of (1). The entropy term is trickier, since we do not 
have access to the density q{9) directly. However, if we 
know the entropy of the initial distribution. S'[go (6*)], then 
we can estimate S[gt(0)] by tracking the change in entropy 
at each iteration, calculated by the change of variables for¬ 
mula. 

To compute how the volume shrinks or expands due to an 
iteration of the optimizer, we require access to the Jacobian 
of the optimizer’s transition operator, J{9): 

Sfe+i]-S[gt]=E,,(,,)[log|J(0*)l]- (2) 

Note that this analysis assumes that the mapping T is bijec- 
tive. Combining these terms, we have an unbiased estima¬ 
tor of C at iteration T, based on the sequence of parameters, 
9q, , 9t, from a single training run: 


T-l 


Clqr] « logp(6>r,x) + ^ log | J(6»t)| -f S[go] ■ (3) 

Energy ^^^ 

Entropy 


3 The entropy of stochastic gradient descent 

In this section, we give an unbiased estimate for the change 
in entropy caused by SGD updates. We’ll start with a naive 
method, then in section 3.1, we give an approximation that 
scales linearly with the number of parameters in the model. 

Stochastic gradient descent is a popular and effective opti¬ 
mization procedure with the following update rule: 

9t+i =9t- aVL{9), (4) 

where the L{9) the objective loss (or an unbiased estimator 
of it e.g. using minibatches) for example — logp(0,x), 
and a is a ‘step size’ hyperparameter. Taking the Jacobian 
of this update rule gives the following unbiased estimator 
for the change in entropy at each iteration: 

S[qt+i]-S[qt]^\og\I-aHt{9t)\ (5) 

where Ht is the Hessian of — log pt{9, x) with respect to 9. 

Note that the Hessian does not need to be positive definite 
or even non-singular. If some directions in 9 have nega¬ 
tive curvature, as on the crest of a hill, it just means that 
optimization near there spreads out probability mass, in¬ 
creasing the entropy. There are, however, restrictions on 
a. If aXi = 1, for any i, where Ai are the eigenvalues of 


Algorithm 1 stochastic gradient descent with entropy esti¬ 
mate 


input: Weight initialization scale cto, step size a, 
twice-differentiable negative log-likelihood L(9, t) 
initialize 9 q ^ A/’(0, ctoId) 
initialize S'o = y (1 + log 2tt) + D log (Tq 

for f = 1 to T do 

St = St-i + log |I — aHt-i\ > Update entropy 
9t = 9t-i — aS7L{9t,t) [> Update parameters 

end for 

output sample 9t, entropy estimate St 


Ht, then the change in entropy will be undefined (infinitely 
negative). This corresponds to a Newton-like update where 
multiple points collapse to the optimum in a single step 
giving a distribution with zero variance in a particular di¬ 
rection. However, gradient descent is unstable anyway if 
ccAmax > 2, where Amax is the largest eigenvalue of Ht. So 
if we choose a sufficiently conservative step size, such that 
aAmax < 1, this situation should not arise. Algorithm 1 
combines these steps into an algorithm that tracks the ap¬ 
proximate entropy during optimization. 

So far, we have treated SGD as a deterministic procedure 
even though, as the name suggests, the gradient of the loss 
at each iteration may be replaced by a stochastic version. 
Our analysis of the entropy is technically valid if we fix the 
sequence of stochastic gradients to be the same for each 
optimization run, so that the only randomness comes from 
the parameter initialization. This is a tendentious argument, 
similar to arguing that a pseudorandom sequence of num¬ 
bers has only as much entropy as its seed. However, if we 
do choose to randomize the gradient estimator differently 
for each training run (e.g. choosing different minibatches) 
then the expression for the change in entropy. Equation 5, 
remains valid as a lower bound on the change in entropy 
and the subsequent calculation of C remains a true lower 
bound on the log marginal likelihood. 


3.1 Estimating the Jacobian in high dimensions 


The expression for the change in entropy given by (5) 
is impractical for large-scale problems since it requires 
an O {D^) determinant computation. Eortunately, we 
can make a good approximation using just one or two 
Hessian-vector products, which can usually be performed 
in O (D) time using reverse-mode differentiation (Pearl- 
mutter, 1994). 

The idea is that since aA^ax is small, the Jacobian is ac¬ 
tually just a small perturbation to the identity and we can 










Algorithm 2 linear-time estimate of log-determinant of 
Jabobian of one iteration of stochastic gradient descent 
1: input: step size a, current parameter vector 6, twice- 
differentiable negative log-likelihood L{9) 

2: initialize tq ^ A/’(0, aolo) 

3: ri = ro — arJVVL(0,f) 

4: r2 = ri — Q!r7VVL(0,f) 

5: £ = rj (-2ro -f 3ri - r 2 ) 

6: output C, an unbiased estimate of a parabolic lower 
bound on the change in entropy. 


approximate its determinant using traces as follows: 

D 

log |/ - aH\ = ^ log (1 - aXi) 

D 

- X! “ (aAi)^] (6) 

= -aTr [H] - a^Tr [HH] . (!) 

The bound in (6) is just a second order Taylor expansion 
of log(l — x) about X = 0 and is valid if aXi < 0.68. As 
we argue above, the regime in which SGD is stable requires 
that aAmax < 1, so again choosing a conservative learning 
rate keeps this bound in the correct direction. For suffi¬ 
ciently small learning rates, this bound becomes tight. 

The trace of the Hessian can be estimated using inner prod¬ 
ucts of random vectors (Bai et al., 1996): 

Tr [H] = E [r^iJr] , r ~ AA(0, 1) . (8) 

We use this identity to derive algorithm 2. In high di¬ 
mensions, the exact evaluation of the determinant in step 5 
should be replaced with the approximation given by algo¬ 
rithm 2. 

Note that the quantity we are estimating (5) is well- 
conditioned, in contrast to the related problem of com¬ 
puting the log of the determinant of the Hessian itself. 
This arises, for example, in making the Laplace approxi¬ 
mation to the posterior (MacKay, 1992). This is a much 
harder problem since the Hessian can be arbitrarily ill- 
conditioned, unlike our small Hessian-based perturbation 
to the identity. 

3.2 Parameter initialization, priors, and objective 
functions 

What initial parameter distribution should we use for SGD? 
The marginal likelihood estimate given by (3) is valid no 
matter which initial distribution we choose. We could con¬ 
ceivably optimize this distribution in an outer loop using 
the marginal likelihood estimate itself. 

However, using the prior distribution has several advan¬ 
tages. First, it is usually designed to have broader support 


than the likelihood. Since SGD usually decreases entropy, 
starting with a high-entropy distribution is a good heuristic. 

The second advantage has to do with our choice of objec¬ 
tive function. The obvious choice is the (unnormalized, 
negative) log-posterior, but we can actually use any func¬ 
tion we like. A more sensible choice is the negative log- 
likelihood. variational distributions only differ from the ini¬ 
tial distribution to the extent that the posterior differs from 
the prior. One nice implication is that the entropy estimate 
will be exactly correct for parameters that don’t affect the 
likelihood. Because of these favorable properties, we use 
these choices for the initial distribution and objective in our 
experiments. 

4 Designing entropy-friendly optimization 
methods 

SGD optimizes the training loss, not he variational lower 
bound. In some sense, if this optimization happens to create 
a good variational distribution, it’s only by accident. Why 
not design a new optimization method that produces good 
variational lower bounds? In place of SGD, we can use 
any optimization method for which we can approximate the 
change in entropy, which in practice means any optimiza¬ 
tion for which we can compute Jacobian-vector products. 

An obvious place to start is with stochastic update rules 
inspired by Markov Chain Monte Carlo (MCMC). Pro¬ 
cedures like Hamiltonian Monte Carlo (Neal, 2011) and 
Langevin dynamics MCMC (Welling & Teh, 2011) look 
very much like optimization procedures but actually have 
the posterior as their stationary distribution. This is exactly 
the approach taken by Salimans et al. (2014). One difficulty 
with using stochastic updates, however, is that calculating 
the change in entropy at each iteration requires access to 
the current distribution over parameters. As an example, 
consider that convolving a delta function with a Gaussian 
yields an infinite entropy increase, whereas convolving a 
broad uniform distribution with a Gaussian yields only a 
small increase in entropy. Welling & Teh (2011) handle 
this by learning a highly parameterized “inverse model” 
which implicitly models the distribution over parameters. 
The downside of this approach is that the parameters of 
this model must be learned in an outer loop. 

Another approach is to try to develop deterministic update 
rules that avoid some of the pathologies of update rules 
like SGD. This could could be a research agenda in itself, 
but we give one example here of a modification to SGD 
which can improve the variational lower bound. One prob¬ 
lem with SGD in the context of posterior approximation 
is that SGD can collapse the variational distribution into 
low-entropy filaments, shrinking in some directions to be 
orders of magnitude smaller than the width of the true pos¬ 
terior. A simple trick to prevent this is to apply a nonlin- 







Figure 2; The variational distribution implied by the modi¬ 
fied, “entropy-friendly”, SGD algorithm. Compared to Fig¬ 
ure 1, the variational distributions are slower to collapse 
into low-entropy filaments, causing the marginal likelihood 
to remain higher. 

ear, parameter-wise warping to the gradient, such that di¬ 
rections of very small gradient do not get optimized all the 
way to the optimium. For example, the modified gradient 
(and resulting modified Jacobian) could be 

g'= 9 - 9 otanh{g/go) (9) 

J' = (1 - cosh“^(5/5o)) J (10) 

where go is a “gradient threshold” parameter that sets the 
scale of this shrinkage. The effect is that entropy is not re¬ 
moved from parameters which are close to their optimum. 
An example showing the effect of this entropy-friendly 
modification is shown in Figure 2. 

5 Experiments 

In this section we show that the marginal likelihood esti¬ 
mate can be used to choose when to stop training, to choose 
model capacity, and to optimize training hyperparameters 
without the need for a validation set. We are not attempting 
to motivate SGD variational inference as a superior alterna¬ 
tive to other procedures; we simply wish to give a proof of 
concept that the marginal likelihood estimator has reason¬ 
able properties. Further refinements are likely to be neces¬ 
sary before this marginal likelihood estimator is more prac¬ 
tical than simply using a validation set. 

5.1 Choosing when to stop optimization 

As a simple demonstration of the usefulness of our 
marginal likelihood estimate, we show that it can be used 
to estimate the optimal number of training iterations before 
overfitting begins. We performed regression on the Boston 


housing dataset using a neural network with one hidden 
layer having 100 hidden units, sigmoidal activation func¬ 
tions, and no regularization. Figure 3 shows overfitting and 
shows that marginal likelihood peaks at a similar place to 
the peak of held-out log-likelihood, which is where early 
stopping would occur when using a large validation set. 



Figure 3: Top: Training and test-set error on the Boston 
housing dataset. Bottom: Stochastic gradient descent 
marginal likelihood estimates. The dashed line indi¬ 
cates the iteration with highest marginal likelihood. The 
marginal likelihood, estimated online using only the train¬ 
ing set, and the test error peak at a similar number of itera¬ 
tions. 

5.2 Choosing the number of hidden units 

The marginal likelihood estimate is also comparable be¬ 
tween training runs, allowing us to use it to select model 
hyperparameters, such as the number of hidden units. 

Figure 4 shows marginal likelihood estimates as a function 
of the number of hidden units in the hidden layer of a neu¬ 
ral network trained on 50,000 MNIST handwritten digits. 
The largest network trained in this experiment contains 2 
million parameters. 

The marginal likelihood estimate begins to decrease for 
more than 30 hidden units, even though the test-set like¬ 
lihood in maximized at 300 hidden units. We conjecture 
that this is due to the marginal likelihood estimate penaliz¬ 
ing the loss of entropy in parameters whose contribution to 
the likelihood was initially large, but were made irrelevant 
later in the optimization. 

5.3 Optimizing training hyperparameters 

We can also use marginal likelihoods to optimize training 
parameters such as learning rates, initial distributions, or 
any other optimization parameters. As an example. Fig¬ 
ure 5 shows the marginal likelihood estimate as a function 
of the gradient threshold in the entropy-friendly SGD algo- 















5.4 Implementation details 

To allow easy computation of Hessian-vector products 
in arbitrary models, we implemented a reverse-mode au¬ 
tomatic differentiation package for Python, available at 
github.com/HIPS/autograd. This package oper¬ 
ates on standard Numpy (Oliphant, 2007) code, and can 
differentiate code containing loops, branches, and even its 
own gradient evaluations. 

Code for all experiments in this paper is available at 

github.com/HIPS/maxwells-daemon. 

6 Limitations 


Figure 4: Top\ Training and test-set likelihood as a function 
of the number of hidden units in the first layer of a neu¬ 
ral network. Bottom: Stochastic gradient descent marginal 
likelihood estimates. In this case, the marginal likelihood 
over-penalizes high numbers of hidden units. 


rithm from section 4 trained on 50,000 MNIST handwritten 
digits. 



Figure 5: Top: Training and test-set likelihood as a function 
of the gradient threshold. Bottom: Marginal likelihood as 
a function of the gradient threshold. A gradient threshold 
of zero corresponds to standard SGD. The increased lower 
bound for non-zero thresholds indicates that the entropy- 
friendly variant of SGD is producing a better implicit vari¬ 
ational distribution. 


As the level of thresholding increases, the training and test 
error get worse due to under-fitting. However, for inter¬ 
mediate thresholds, the lower bound increases. Because it 
is a lower bound, its increase means that the estimate of 
the marginal likelihood is becoming more accurate, even 
though the actual model happens to be getting worse at the 
same time. 


In practice, the marginal likelihood estimate we present 
might not be useful for several reasons. First, using only 
a single sample to estimate both the expected likelihood as 
well as the entropy of an entire distribution will necessarily 
have high variance under some circumstances. These prob¬ 
lems could conceivably be addressed by ensembling, which 
has an interpretation as taking multiple exact independent 
samples from the implicit variational posterior. 

Second, as parameters converge, their entropy estimate 
(and true entropy) will continue to decrease indefinitely, 
making the marginal likelihood arbitrarily small. However, 
in practice there is usually a limit to the degree of overht- 
ting possible. This raises the question; when are marginal 
likelihoods a good guide to predictive accuracy? Presum¬ 
ably the marginal likelihood is more likely to be correlated 
with predictive performance when the implicit distribution 
has moderate amounts of entropy. In section 4 we modihed 
SGD to be less prone to produce regions of pathologically 
low entropy, but a more satisfactory solution is probably 
possible. 

Third, if the model includes a large number of parameters 
that do not affect the predictive likelihood, but which are 
still affected by a regularize^ their convergence will penal¬ 
ize the marginal likelihood estimate even though these pa¬ 
rameters do not affect test set performance. This is why 
in section 3.2 we recommend optimizing only the log- 
likelihood, and incorporating the regularizer directly into 
the initialization procedure. More generally however, en¬ 
tropy could be underestimated if a large group of param¬ 
eters are initially constrained by the data, but are later 
“turned off” by some other parameters in the model. 

Finally, how viable is optimization as an inference method? 
Standard variational methods find the best approximation 
in some class, but SGD doesn’t even try to produce a good 
approximate posterior, other than by seeking the modes. In¬ 
deed, Figure 1 shows that the distribution implied by SGD 
collapses to a small portion of the true posterior early on, 
and mainly continues to shrink as optimization proceeds. 
However, the point of early stopping is not that the inter- 















mediate distributions are particularly good approximations, 
but simply that they are better than the point masses that 
occur when optimization has converged. 

7 Related work 

Estimators for early stopping Stein’s unbiased risk esti¬ 
mator (SURE) (Stein, 1981) provides an unbiased estimate 
of generalization performance under very broad conditions, 
and can be used to construct a stopping rule. Raskutti et al. 
(2014) derived a SURE estimate for SGD in a regression 
setting. Interestingly, this estimator depends on the ‘shrink¬ 
age matrix’ ~ ottHr), which is just the Jacobian 

of the entire SGD procedure along a particular path. How¬ 
ever, this estimator depends on an estimate of the noise 
variance, and is restricted to the i.i.d. regression setting. It’s 
also not clear if these stopping rules could also be used to 
select other training parameters or model hyperparameters. 

Reversible learning Optimization is an intrinsically 
information-destroying process, since a (good) optimiza¬ 
tion procedure maps any initial starting point to one or a 
few final optima. We can quantify this loss of informa¬ 
tion by asking how many bits must be stored in order to 
reverse the optimization, as in Maclaurin et al. (2015). We 
can think of the number of bits needed to exactly reverse 
the optimization procedure as the average number of bits 
‘learned’ during the optimization. 

Erom this perspective, stopping before optimization con¬ 
verges can be seen as a way to limit the number of bits 
we try to learn about the parameters from the data. This 
is a reasonable strategy, since we don’t expect to be able to 
learn more than a finite number of bits from a finite dataset. 
This is also an example of reducing the hypothesis space to 
improve generalization. 

MCMC for variational inference Our method can be 
seen as a special case of Salimans et al. (2014), who 
showed that any set of stochastic dynamics, even those not 
satisfying detailed balance, can be used to implicitly define 
a variational distribution. However, to provide a tight vari¬ 
ational bound, one needs to estimate the entropy of the re¬ 
sulting implicit distribution. Salimans et al. (2014) do this 
by defining an inverse model which estimates backwards 
transition probabilities, and then optimizes this model in 
an outer loop. In contrast, our dynamics are deterministic, 
and our estimate of the entropy has a simple fixed form. 

Bayesian neural networks Variational inference 
has been performed in Bayesian neural-network 
models (Graves, 2011; Hensman & Lawrence, 2014; 
Hernandez-Lobato & Adams, 2015). Kingma & Welling 
(2014) show how neural networks having unknown 
weights can be reformulated as neural networks having 
known weights but stochastic hidden units, and exploit this 


connection to preform efficient gradient-based inference in 
Bayesian neural networks. 

Black-box stochastic variational inference Kucukelbir 
et al. (2014) introduce a general scheme for variational in¬ 
ference using only the gradients of the log-likelihood of a 
model. However, they constrain their variational approxi¬ 
mation to be Gaussian, as opposed to our free-form varia¬ 
tional distribution. 

8 Future work and extensions 

Optimization with momentum One obvious extension 
would be to design an entropy estimator of momentum- 
based optimizers such as stochastic gradient descent with 
momentum, or refinements such as Adam (Kingma & Ba, 
2014). However, it is difficult to track the entropy change 
during the updates to the momentum variables. 

Gradient-based hyperparameter optimization Hyper¬ 
parameters typically come in two forms; Regularization 
parameters and training parameters. Optimizing marginal 
likelihood rather than training loss lets us set regularization 
parameters during training without using a validation set. 
The marginal likelihood estimate lets us optimize the vari¬ 
ational parameters (training hyperparameters) in an outer 
loop. However, optimizing more than a few of these is diffi¬ 
cult without gradients. We could gain access to exact gradi¬ 
ents of the variational lower bound with respect to all vari¬ 
ational parameters by simply using reverse-mode differen¬ 
tiation. Domke (2012); Maclaurin et al. (2015) showed that 
this can be done in a memory-efficient way for momentum- 
based learning procedures. Combining these two proce¬ 
dures would allow one to set all hyperparameters using 
gradient-based methods without the need for a validation 
set. 

Stochastic dynamics One possible method to deal with 
over-zealous reduction in entropy by SGD would be to add 
noise to the dynamics. In the case of Gaussian noise, we 
would recover Langevin dynamics (Neal, 201 1). However, 
estimating the entropy becomes much more difficult in this 
case. Welling & Teh (2011) introduced stochastic gradient 
Langevin dynamics for doing inference with minibatches. 
Ma et al. (2013) use Langevin dynamics and a floating tem¬ 
perature to estimate partition functions of graphical mod¬ 
els. 

More generally, we are free to design optimization algo¬ 
rithms that do a better job of producing samples from the 
true posterior, as long as we can track their entropy. The 
gradient-thresholding method proposed in this paper is a 
simple first example of a refinement to SGD that maintains 
a tractable entropy estimate while improving the quality of 
the intermediate distributions. 


9 Conclusion 

Optimization algorithms with random initializations im¬ 
plicitly define a series of distributions which converge to 
posterior modes. We showed that these nonparametric dis¬ 
tributions can be seen as variational approximations to the 
true posterior. We showed how to produce an unbiased 
estimate of this variational lower bound by approximately 
tracking the entropy change at each step of optimization. 

This simple and inexpensive calculation turns standard gra¬ 
dient descent into an inference algorithm, and allows the 
optimization of hyperparameters without a validation set. 
Our estimator is compatible with using data minibatches 
and scales linearly with the number of parameters, making 
it suitable for large-scale problems. 

9.1 Acknowledgements 

We are grateful to Roger Grosse, Miguel Hernandez- 
Lobato, Matthew Johnson, and Oren Rippel for helpful dis¬ 
cussions. We thank Analog Devices International and Sam¬ 
sung Advanced Institute of Technology for their support. 

References 

Bai, Zhaojun, Fahey, Gark, and Golub, Gene. Some large- 
scale matrix computation problems. Journal of Compu¬ 
tational and Applied Mathematics, 74(l):71-89, 1996. 

Domke, Justin. Generic methods for optimization-based 
modeling. In International Conference on Artificial In¬ 
telligence and Statistics, pp. 318-326, 2012. 

Graves, Alex. Practical variational inference for neural net¬ 
works. In Advances in Neural Information Processing 
Systems, pp. 2348-2356, 2011. 

Hensman, James and Lawrence, Neil D. Nested variational 
compression in deep Gaussian processes. arXiv preprint 
arXiv:1412.1370, 2014. 

Hernandez-Lobato, Jose Miguel and Adams, Ryan P. 
Probabilistic backpropagation for scalable learn¬ 
ing of bayesian neural networks. Arxiv preprint 
arXiv:1502.05336, 2015. 

Kingma, Diederik and Ba, Jimmy. Adam: A 
method for stochastic optimization. arXiv preprint 
arXiv:1412.6980, 2014. 

Kingma, Diederik and Welling, Max. Efficient gradient- 
based inference through transformations between bayes 
nets and neural nets. In Proceedings of the 31st Interna¬ 
tional Conference on Machine Learning (ICML-14), pp. 
1782-1790, 2014. 

Kucukelbir, Alp, Ranganath, Rajesh, Gelman, Andrew, and 
Blei, David. Fully automatic variational inference of dif¬ 
ferentiable probability models. In NIPS Workshop on 
Probabilistic Programming, 2014. 


Ma, Jianzhu, Peng, Jian, Wang, Sheng, and Xu, Jinbo. Es¬ 
timating the partition function of graphical models using 
langevin importance sampling. In Proceedings of the 
Sixteenth International Conference on Artificial Intelli¬ 
gence and Statistics, pp. 433^41, 2013. 

MacKay, David JC. A practical bayesian framework for 
backpropagation networks. Neural computation, 4(3): 
448^72, 1992. 

Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P. 
Gradient-based hyperparameter optimization through re¬ 
versible learning. Arxiv preprint arXiv:1502.03492, 
2015. 

Neal, Radford M. MCMC using hamiltonian dynamics. 
Handbook of Markov Chain Monte Carlo, 2, 2011. 

Oliphant, Travis E. Python for scientific computing. Com¬ 
puting in Science & Engineering, 9(3): 10-20, 2007. 

Pearlmutter, Barak A. Fast exact multiplication by the Hes¬ 
sian. Neural computation, 6(1):147-160, 1994. 

Raskutti, Garvesh, Wainwright, Martin J., and Yu, Bin. 
Early stopping and non-parametric regression: an opti¬ 
mal data-dependent stopping rule. The Journal of Ma¬ 
chine Learning Research, 15(l):335-366, 2014. 

Salimans, Tim, Kingma, Diederik P, and Welling, Max. 
Markov chain Monte Carlo and variational inference: 
Bridging the gap. arXiv preprint arXiv:1410.6460, 2014. 

Stein, Charles M. Estimation of the mean of a multivariate 
normal distribution. The Annals of Statistics, 9(6): 1135- 
1151, 1981. 

Wainwright, Martin J and Jordan, Michael I. Graphical 
models, exponential families, and variational inference. 
Foundations and Trends in Machine Learning, 1(1-2): 1- 
305, 2008. 

Welling, Max and Teh, Yee Whye. Bayesian learning via 
stochastic gradient Langevin dynamics. In Proceedings 
of the 28th International Conference on Machine Learn¬ 
ing (ICML-11), pp. 681-688, 2011. 



